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Abstract 

This paper gives a brief survey of binary single-deletion-correcting 
codes. The Varshamov-Tenengolts codes appear to be optimal, but 
many interesting unsolved problems remain. The connections with 
shift-register sequences also remain somewhat mysterious. 

1 Introduction 

The possibility of packet loss on internet transmissions has renewed interest 
in deletion-correcting codes. (Of course there are many other applications 
of such codes, including magnetic recording, although in that case there are 
usually additional conditions that must be satisfied.) This paper considers 
the very simplest family of such codes, binary block codes capable of correct- 
ing single deletions. Even for these codes there remain several apparently 
unsolved problems. 

It is surprising, but these codes do not appear to be surveyed in any of 
the usual references ( |MS77|1 , |PH9|], etc.). This paper is a first attempt at 



such a survey. 

Proofs are given of a number of results, either because the new proofs 
are simpler or because the original sources are hard to locate]]]. 

Definition 1.1. For a vector u E F™, let D e (u) denote the set of e-th or- 
der descendants, i.e. the set of vectors v £ F™~ e that are obtained if e 
components are deleted from u. A subset C C F™ is said to be an e-deletion- 
correcting code if D e (u) (~) D e (v) = for all u,v £ C , u ^ v. Our problem is 
to find the largest such code. In this paper we mostly consider the simplest 
case, o = 2 and e = 1. 



1 and when located are sometimes poorly translated or badly photocopied! 



1 



The deletion distance dd(u, v) between vectors u, v € F™ is denned to be 
one-half of the smallest number of deletions and insertions needed to change 
u to v. Then C is e-deletion-correcting if and only if dd(u, v) > e + 1 for 
u,v £ C, u 7^ v. (For dd(u, v) < e if and only if there is a vector x that 
can be reached from u by at most e deletions and also from v by at most e 
deletions, and then C cannot correct e deletions.) 

Consider the graph G n having a node for every vector ueFJ, with an 
edge joining the nodes corresponding to u, v G F™, u 7^ v, if and only if v 
can be obtained from u by a single deletion and insertion, i.e. if and only 
if D\{u) n Di(v) 7^ 0. The deletion distance dd(u,v) is the length of the 
shortest path from u to v (this shows that dd is indeed a metric). 

In particular, a single-deletion-correcting code corresponds to an inde- 
pendent set in G n . One can now attempt to calculate the sizes of the largest 
independent sets by computer. In the binary case we find that the largest 
single-deletion-correction codes of lengths 1,2,... ,8 have sizes 

1,2,2,4,6, 10,16, > 30 . (1) 

The last entry in ([[]) was kindly computed by my colleague David Johnson. 
Unfortunately Gg is too large for present computers and 30 is at present 
only a lower bound on the size of a maximal independent set.0 

However, (||) turns out to be a useful hint. When one looks up this 



sequence in [ |EIS| ], one finds a unique matching sequence, number A16, whose 
initial terms N\, N2, N%, . . . are 

1, 1, 2, 2, 4, 6, 10, 16, 30, 52, 94, 172, 316, 586, ... (2) 

and whose nth term is given by 

Nn = h £ n>l, (3) 

odd d\n 

where the sum is over all odd divisors d of n and (j) is the Euler totient 
function (sequence |A10| ). The references cited for sequence A16 indicate 
that it has arisen in connection with the enumeration of shift-register se- 
quences [Go67] and tournaments [Br8C]. However there was (at that time) 



no reference to indicate that this sequence has any connection with codes, 
nor was there any apparent connection between the shift-register sequences 
and deletion-correction codes. 



2 Postscript: David Applegate has since used CPLEX's integer programming subrou- 
tines (which combine ordinary linear programming with branch-and-bound) to confirm 
that the largest single-deletion-correcting code of length 8 does indeed have size 30. 
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More conventional search methods, in particular, consulting some well- 
known papers of Levenshtein [I 3v65|, |] 3v65a ] on codes for correcting dele- 
tions, turned up many other relevant references. Some of these will be 
discussed further in Section || The most interesting codes are those of Var- 
shamov and Tenengolts [' T65| ], In [' TG5| ] they present a family of codes 
depending on a certain parameter a. When a is taken to be 0, these codes 
have size _/V n _i (see (|||)) and thus match ([!]). These codes are the subject of 
Section 

Sections |3| and [| will discuss the connection with shift-registers and tour- 
naments, and Section || contains some general remarks about the number of 
descendants of a vector. The final section, Section ||, gives a brief discussion 
of other papers on deletion-correcting and related codes. 



2 The Varshamov- Tenengolts codes 

Definition 2.1. For < a < n, the Varshamov- Tenengolts code VT a (n) 
consists of all binary vectors (x±, . . . ,x n ) satisfying 



E 

i=l 



a (mod n + 1) , 



(4) 



where the sum is evaluated as an ordinary rational integer. 



As will appear, the codes with a = contain the most codewords. The 
first few such codes are 

VT (1) = {0} 

VT (2) = {00,11} 

VT (3) = {000,101} 

VT Q {A) = {0000,1001,0110,1111} 

VT (5) = {00000,10001,01010,110011,11100,00111}, (5) 

of sizes 1,2,2,4,6, matching ([!]) and (||). These codes were introduced in 
VT65| for correcting errors on a Z-channel (or asymmetric channel). Similar 
constructions have been used in [ !] and also in | GS8C| 1 and j [81] to 
construct constant weight codes. 

Levenshtein [] 3v65], Lev65a | observed that the Varshamov- Tenengolts 
codes could be used for correcting single deletions, proving this by giving 
the following elegant decoding algorithm. 
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Decoding algorithm 



• Suppose a codeword x = (x\, . . . ,x n ) G VT a (n) is transmitted, the 
symbol s in position p is deleted, and x' = (x' t , . . . ,x' n _^) is received. 
Let there be Lq O's and L\ l's to the left of s, and Ro O's and Ri l's 
to the right of s (with p = 1 + Lq + L{). 

• We compute the weight w = L\ + R\ of x' and the new checksum 
X^Ji 1 ^'i- If s = the new checksum is R\ (< it?) less than it was 
before, and if s = 1 it is p + R\ = 1 + Lq + L\ + R\ = 1 + w + Lo (> w) 
less than it was before. (These numbers are less than n + 1 so there is 
no ambiguity.) 

• So if the deficiency in the checksum is less than or equal to w we know 
that a was deleted, and we restore it just to the left of the rightmost 
R± l's. Otherwise a 1 was deleted and we restore it just to the right 
of the leftmost Lq O's. 



Table 1: Number of codewords in Varshamov-Tenengolts code VT a (n). 
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The sizes |VT a (n)| of the first few codes are shown in Table |l|. (This 
array forms sequence A53633 in EIS|.) These numbers were studied by 
Varshamov [' ]] and Ginzburg [Gi67|, but the following simple formula 
appears to be new. 



Theorem 2.2. 



^ (m) (n+l)/rf 



d|n+l 

d odd 



(d,a) 



(6) 



where /u(n) is ffoe Mobius function (\A868$ ), and (d,a) = gcd(d,a). 
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Proof. Write w a (n) = \VT a (n)\. We will calculate w a {n — 1), assuming 
throughout that n > 1. It follows from the definition of these codes that the 
generating function 

n-l 

/(*)=J> a (n-l)* a 

a=0 

is equal to 

71-1 



~[(l + z k ) mod z n - 1 . 



k=l 



Let £ = e 2m l n . Then 

71—1 71—1 



m j ) = E w ^ n - l ^ ja = II ( x + £ ifc )< i = o, . . • , n - 1 . 



a=0 fc=l 



We solve this by taking an inverse discrete Fourier transform (cf. ]Ko8S 1 , 
Chap. 97) to obtain 



Wa (n-i) = -^/(ar. 

rj ^ — * 

Since 



n 

j=0 



71-1 

IJ( 2f _ f *) = z»-l, 

fc=0 

we can calculate explicitly. An elementary calculation gives 

' 2S- 1 if d = ra/s is odd, 
if d = n/g is even, 

where g = gcd(n,j). Therefore 



/(a 



n 

w ^ n - 1 ) = ^E 2n/d E r ia 

d | n j — 1 

d odd gcd(nj)=n/d 



which becomes, writing j = kn/d, 



1 a 

— 2 n / d e -2irika/d 

2n ^ ^ 



d\n fc— 1 

d odd (fc,d)=l 
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The innermost sum is a Ramanujan sum Cd(a) ( [|Ap76 |, p. 160), which 
simplifies to 



c d (a) = 4>(d) — 



d 

(d,a) 



(d,a) t 

( |Ap76H , p. 164). □ 



Corollary 2.3. 

(i) |FTo(n)l = 2hTD S m* n+l)/d , (7) 

d odd 

(ii) iVTiCn)! = £ ^(d)2(" +1 )/ d , (8) 



d|n+l 
d odd 



(mj For any a, 

\VT (n)\ > \VT a (n)\ > |FTi(o)| . (9) 



Remark 2.4. (i) and the left-hand inequality in (hi) are due to Varshamov 
Var65| 1 , and (ii) and the right-hand inequality in (hi) to Ginzburg [Gi67|. 



Proof, (i) and (ii) follow immediately from Theorem |2.2| , as does the left- 
hand side of (hi) using fi(k) < <p{k) for all k. To establish the right-hand 
side of (hi), let p be the smallest odd prime dividing both n + 1 and a (if no 
such prime exists then |VT a (n)| = |VTi(n)|). The terms in the expressions 
for | VT a (n)| and |VTi(ra)| agree for d < p, and at d = p the term in |VT a (n)| 
exceeds that in |FZi(n)| by p2 n / p . It is easy to check that the remaining 
terms can never make the sum in | VT\ (n) \ catch up with the sum in | VT a (n) | . 

□ 



Optimality 

It is more difficult to obtain upper bounds for deletion-correcting codes 
than for conventional error-correcting codes, since the disjoint balls D e {u) 
associated with the codewords (see Section [j]) do not all have the same size. 
Furthermore the metric space (¥^,dd) is not an association scheme and so 
there is no obvious linear programming bound. 
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The size of D\{u) is easily seen to be equal to r(u), the number of runs 
in u. Furthermore the number of vectors in F?j with r runs is 2(™~ 1 ). (We 
will discuss |D e (u)| further in Section ||.) 

Let A(n, e) denote the size of the largest e-deletion-correcting binary 
code of length n, and call a code C optimal if \C\ = A(n, e). The values of 
A(n, 1) for n < 9 were given in Section [l], and show that VTo(n) is optimal 
for n < 9. 

For large n, the codes VTq(ti) are certainly close to being optimal, since 
on the one hand we have 

2 n 

\VT (n)\ > — , (10) 
from (||) , and on the other hand we have the following result of Levenshtein 



[Lev65|: 



Theorem 2.5 ( f[Lev65f| ). 

2 n 

A(n, 1) ~ — , as n — > cxo . 
n 

Proof. ( |l0| ) gives a lower bound. Let C be an optimal code. Following 
Levenshtein, let Co denote the subset of C consisting of the vectors u G C 
with 

— — \J n log n < r(n) < — + y n log n 
and let C\ = C\Cq. Since the sets D\{u), u S C, must be disjoint, 

10,1 < - — r == < - . 

f - V n log n n 

Furthermore, 



n — 1 



r=l 



which is much smaller than 2 n /n. □ 



In a later paper, Levenshtein [Lcv92] defines a code C to be perfect if 



the balls D e (u), u G C, partition the set Fg -6 . In |Lcv92[| he proves the 
remarkable fact that all the codes VTo(n), yTi(n), VT2(n), . . . are perfect 
single-deletion-correcting codes. The argument, not reproduced here, is es- 
sentially just a refinement of the decoding algorithm for these codes given 
above. 
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It is initially surprising that perfect codes of the same length can have 
different numbers of codewords, but this is explained by the fact that the 
balls Di(u) have different sizes. 

In view of this and the result in (|9|) , it is tempting to make the following 
conjecture. 

Conjecture 2.6. The codes VTq(tl) are optimal for all n. 

This is true for n < 8, as already mentioned, but for larger n it is possible 
that other, smaller, perfect codes may exist, or even that smaller, optimal 
but non-perfect codes may exist. 

Indeed, consider the code {000, 111}. For this code, X^neC l-^iC 14 )! = 1 + 
1 = 2 < 4, so this is optimal but not perfect. For length 4, {0000, 0011, 1100, 1111} 
contains as many codewords as VTq(4) (compare @), and again is optimal 
but not perfect. 

At length 6 it is possible to replace two codewords of VTq(6) by two 
other vectors without affecting its ability to correct single deletions: 110100 
and 001011 can be replaced by 111000 and 000111. The former pair cover 
eight vectors of length 5, but the latter only cover four vectors of length 5, 
leaving four vectors uncovered. This suggests the possibility that in some 
larger code VTq(ti) it may be possible to replace k vectors by k + 1 vectors, 
which would prove that these codes are not optimal. 

In view of these remarks, Conjecture does not seem especially com- 
pelling! 



Linearity 

As can be seen from (|J), the codes VTq(ti) are linear for n < 4. They are 
never again linear, since, for n > 5, VTo(n) contains the vectors 1 ... 1 
and 110 0... 100 but not their sum. 

In particular, even though | VTq (T) | = 16, this code is not linear. One 
might wonder if it is possible to find a linear code that will do as well, but 
a computer search has shown that no such code exists. 

On the other hand, by adapting a construction of Tenengolts | Tcn76[| , one 



can modify the Varshamov- Tenengolts construction to obtain linear codes, 
with only a small increase in the length of the code. 

Definition 2.7. Given k > 1, let 



n = k + 



\]lk + 9/4 + 1/2 
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The linear single- deletion- correcting code VTg(n) has dimension k and con- 
sists of all vectors [x\,... ,x n ) £ Fg, where x±,... ,Xk are information 
symbols and the c = n — k check symbols Xk+i, ■ ■ ■ ,x n are chosen so that 
Yh=i i- x i = ( m od n + 1). 

The construction works because c is just large enough so that C^ 1 ) > 
n + 1, and so the sums X^fc+i cover n + 1 consecutive values modulo 
n + 1. We omit the details. 

The number of check symbols in these codes is of the order of y/2n, 
compared with O(logn) for the VTo(n) codes. So we end this section with a 
final question: What are the optimal linear single-deletion-correcting codes? 



3 Shift register sequences 



As mentioned in Section [|, the entry for sequence 7 6 in 1 5 !| indicates 



that these numbers also arise in the enumeration of shift register sequences 
HGo67f| . We will show here that indeed this is the same sequence. But 
whether this is anything more than a coincidence remains an open question. 
Of course there are well-known connections between shift-register sequences 
and conventional error-correcting codes (cf. S77 |, Chapter 7), so there 
should be a deeper explanation. 



The context in which sequence A16 appears in Golomb's book Go67] 
is the enumeration of the (infinite) output sequences from certain types of 
n-stage binary shift registers. We consider four kinds of shift registers: the 
pure cycling register (or PCR), as illustrated in Fig. [l], the complemented 
cycling register (or CCR), the pure summing register (or PSR) and the com- 
plemented summing register (or CSR). If the shift register has n cells, initially 
containing x±,X2, ■ ■ ■ , x n (xi = or 1), then x\ is appended to the output 
stream, symbols X2, ■ ■ ■ ,x n move to the left, and the symbol 

(PCR) x x 

(CCR) 1 + xx 

(PCR) xx + x 2 H \-x n or 

(CSR) 1 + xx + x 2 + ■ ■ ■ + x n 

is fed back to the right-most cell. 

The problem is to determine the numbers of different possible output 
sequences from these registers, which we denote by Z(n), Z*(n), S(n) and 
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%1 



x 2 



Figure 1: An n-stage pure cycling register. 



S*(n), respectively. For example 5* (5) = 6, corresponding to the sequences 



000001000001 
000111000111 
001011001011 
010011010011 
010101010101 
011111011111 



all having period 6 (or a divisor of 6). 

Table [2], based on |Go67 , page 172], shows the first few values of these 
functions, together with the corresponding sequence numbers from [EIS]. 

Explicit formulas for these functions are given in the next theorem. 



Theorem 3.1. For n > 1, 

z{n) = -Y,m^ n/d , 



d\n 



Z*(n) 
S(n) 



s>-i) = i- J>(d)2» 



Id 



d\ n 
d odd 



2(n + 



l — y, <P(2d)2 (n+l)/d ■ 



(11) 
(12) 

(13) 



d\n+l 



Remark 3.2. Golomb proves (|ll|) and sketches proofs of the other results. 



Actually (13) is due to Michael Somos (personal communication), Golomb's 
version (given in (|j~5|) below) being slightly more complicated. The numbers 
Z{n) (sequence I 11) in the first column are also familiar as the number 
of binary irreducible polynomials of degree dividing n, and the number of 
ra-bead necklaces formed with beads of two colors, when the necklaces may 
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Table 2: Number of ouput sequences from n-stage shift registers of types 
PCR, CCR, PSR, CSR. 
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CCR 


PSR 


CSR 


n 


Z(n) 


Z*(n) 


S(n) 


S*(n) 
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2 
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14 


6 


10 


10 
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20 


10 


20 


16 
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36 


16 


30 


30 


9 


60 


30 


56 


52 


10 


108 


52 


94 


94 


Sequence: 


A31 






A16 






A13 






A16 



not be turned over (cf. fBd|, Chap. 4], pR6"l , pS77] , Chap. 4], pt9Sj 5 



Problem 7.112]). Fredricksen |Fr7C(| shows that Z(ra) — 1 is the number of 
l's in the truth table defining the lexicographically least de Bruijn cycle. 



Proof. Note that sequence [A16| appears in two places in the table, for CCR 
registers of length n and CSR registers of length n — 1. We begin by ex- 
plaining this, and thus proving that 

Z*(n) = S*(n- 1) . (14) 

Suppose for concreteness that n = 4. The output sequences from the four 
types of register are (omitting plus signs, and writing la rather than 1 + a, 
etc.): 

(i) abed a b c da 

(ii) abed la lb lc Id a b c d ■ ■ ■ 

(iii) abed abed abed abed a b c d ■ ■ ■ 

(iv) abed labed abed labed a b c d ■ ■ ■ 

In general these sequences have periods n, 2n, n + 1 and n + 1, respectively. 
If we replace (ii) by the sums of adjacent pairs we get 

ab be cd lad ab be cd lad ... , 
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a CSR(3) sequence. Conversely, given a CSR(3) sequence, say 



ABC 1ABC ABC 1ABC 

of period 4, there is a unique CCR(4) sequence of period 8 corresponding to 
it, namely 

A AB ABC 1 IA 1AB 1ABC A .... 



Applying this argument in the general case establishes (|14|). 

In the rest of the proof we make use of Burnside's lemma (cf. [ |St99| ] ) , 
which states that the number of orbits of a finite permutation group G is 
equal to the average number of points that are fixed by the elements of G. 



Let us first prove (|l l|) . (This is Golomb's proof |Go67| , p. 121].) We take 
G to be the cyclic group of order n generated by tt = (1, 2, . . . , n), acting 
on FJ;. The permutation tt 1 (1 < i < n) contains gcd(n,i) cycles, each of 
length n/gcd(n,i), and has order n/gcd(n,i). There are precisely 2 9cd ^ n ' 1 ^ 
vectors fixed by tt 1 , since each cycle must consist of all O's or all l's. Hence, 
by Burnside's lemma, 



1 n 

Z( n ) = ±X^ 2 9cd( - n > i) 
i=i 

1 n 

n ^ ' ^ 

k\n i=1 

gcd(n,i) — k 

= Iw y i 

9 cd(f,i)=l 



n V A; 

fc|n 



n ± — ^ 



n 

d\n 



To establish ([12]), we note from (iv) that S*(n— 1) is equal to the number 
of orbits of the same group, but now acting on binary vectors of length n 
and odd weight. The number of odd weight vectors fixed by ir z is 2 9cd ^ n ' l ^~ 1 
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if the cycle lengths n/gcd(n,i) are odd, and zero otherwise. Hence 

1 n 

S*(n-1) = - V 2° cd ^- 1 



n 

1=1 

n/ gcd(n,i) odd 

fc-i 



= - y *(- 

k | n 
n/k odd 

= ^EW- 

din 
d odd 

Finally, we prove (|l3|), by determining S(n — 1). The group is the same, 
but now (see (hi)) acting on even weight vectors. If d = n/gcd(n,i) is even 
there are 2 d fixed vectors, but if d is odd only 2 d ~ 1 fixed vectors. Hence 

S(n-l) = -J2 0(d)2 d " 1 + - Yl W ( 15 ) 

d\n d\n 
d odd d even 

d\n 

since <j){2d) = 4>(d) if d odd, <p(d) = 2<j)(d) if d even. □ 

But a mystery still remains: is the fact that the number of codewords in 
VTq(ti) equals Z(n) just a numerical coincidence, or is there a one-to-one cor- 
respondence between the codewords and the CCR shift register sequences? 
(This is essentially equivalent to a research problem stated by Stanley in 
pt86j , Chapter 1, Problem 27(c).) 



Furthermore, why is | VTi(n)| (sequence |A4§ in | EIS[| ), equal to the num- 



ber of (n + l)-bead necklaces with beads of two colors and primitive period 
n + 1, when the two colors may be interchanged but the necklaces may not 
be turned over (cf. |Fi58 |, | GR61f )? This is also the number of irreducible 



polynomials over F2 of degree n+1 in which the coefficient of x n is 1 [Car52], 
fpMRSS I 



4 Locally transitive tournaments 



6] i n [ |EIS| ] also indicates that this sequence arose in Brouwer's 



The entry for 

enumeration [Br80] of locally transitive tournaments. A tournament is a di- 
rected graph with one directed edge between any two nodes. It is transitive 
if there are no directed cycles. 
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A locally transitive tournament is a tournament such that the subgraphs 
on the predecessors of a point and the successors of a point are both tran- 
sitive. 

Brouwer, answering a question raised by P. J. Cameron, determined the 
number of locally transitive tournaments on n nodes. He began by calcu- 
lating the first few values by computer. Then he looked up this sequence 



in [I S], and found the reference to Golomb's book | Go67 |. With this hint 
alone, and without having access to the book, he established a one-to-one 
correspondence between these tournaments and output sequences from shift 
registers of CCR type. From this he obtained the formula 

d\n e |» 

where odd(i) is or 1 according to whether i is even or odd, and /i is the 
Mobius function ( A8683| ). Using the identity 



0(n) = J>(^ 

d\n 



( [[Ap76 ], p. 26), (pi) immediately reduces to (12). 



Again we can ask, is there a connection between locally transitive tour- 
naments and the VTq{ti) codes? 

5 The number of descendants of a vector 

It was already mentioned in Section |2] that |Z?i(it)| = r(u), the number of 
runs in u. 

The next theorem was discovered by E. M. Rains and the author. Al- 
though this must be well-known, we have not found it in the literature. 
The derivative v! G F^ 1 of u = (ui, . . . , u n ) £ F% is given by 

u = (ui + u 2 , u 2 + u 3 , . . . , u n -i + u n ) . 

Note that wt(u') = r{u) — 1. 

Theorem 5.1. 

IftWI = ( r<U) 2 + l )->. I") 
where 5 = 2wt(v!) — wt(u") is the deficiency of u. 
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Sketch of proof. First, suppose u is a "normal" vector, meaning that all runs 
have length > 2, for example 

u=0000111000 
v! = 000100100 
u" = 110 110 

(18) 

(r(u) + 1 \ 
^ J is the number of ways of choosing two things 

out of r(u) with repetitions allowed. If the runs in u have lengths i, j,k,l, . . . , 
the runs in the shortened vector have lengths 

i — 2, j, k, I, ... 
h 3 ' ~ 2, k, I, ... 
h ji k — 2, I, ... 

(19) 

i- 1, 3 ~ h k, I, ... 
i ~ 1, 3, k - 1, I, ... 



For a normal vector wt(u") = 2wt{v!) (cf. @), <5 = and (JT^) holds. 

Next suppose that all runs in u have length > 2 except for a single 
internal run of length 1, as in 

u = 1 
u' = 0001100 
u" = 10 10 

Then 5 = 2, and indeed |Z?2( , u)| is 2 less than it would be for a normal vector, 
since one of the possibilities in (|l^) vanishes and two others coalesce. 

The remaining cases, when there are several runs of length 1, possibly 
including beginning or ending runs, are left to the reader. □ 



It is not clear how to generalize Theorem 5.1 to k-th order descendants. 
Certainly D%(u) is not simply a function of the weights of u, v! , u" and u'" . 

Theorem 5.2. Let 

Hk{n) = max \D k (u)\ 

be the maximal number of k-th order descendants of any binary vector of 
length n. Then 

k 



i=0 



n — k 
i 



Mn) = E ( ' V ) - ( 2 °) 
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for n> k + 1. Equality is achieved just by the vectors 
010101... and 101010... . 



(21) 



According to Calabi and Hartnett [ CH69 ], ( |20| ) is proved in an unpub- 
lished 1967 report^ of Calabi [i a!67 |. The first published proof seems to 
have been given by Levenshtein |Lcv96j ] . It was generalized to the nonbi- 
nary case by Hirschberg | Hir99|1 (see also Levenshtein [LevOl] and Hirschberg 
and Regnier pfflOp . 

It is not difficult to show that the vectors (|2l|) achieve the bound in (|2T 



Theorem 5.3. For the two vectors 010101 

k 



and 101010. 



we have 



\D h (u) 



i=0 



n — k 



(22) 



Proof. Let u = 010101 . . . G FJ> , let M n ^ be the set of k-th order descendants 
of u, and let m n k = \M n k\- Then 



M, 



0|M n _ lifc UM n _ 1)fc _! 

0|M n _i, fc U l|M n _ 2 ,fc_i U M n _ 2jfc _2 



(23) 



where the bars denote binary complementation. However, the last term in 
( f23D can be dropped because it is contained in the union of the other two 
terms. Since these two terms are disjoint, we have 



n,k 



M n -ik + M n _ 2 b~i ■ 



This is a disguised version of the recurrence for binomial coefficients, whose 
solution is given by (p2[). □ 



The case k = 2 of (|20|) is a corollary of Theorem 5.1 
Corollary 5.4. For n > 3, 

2 



Mn)=W n . 2 )=l(n 2 -3n + 4). 

i=0 ^ Z ' 



(24) 



Proof. Let n achieve fJ-2(n). The result is easily verified if r(u) is 1 or 2, so 
we assume r(u) > 3. 

3 I have been unable to locate a copy of this report. 
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Suppose u begins with a string of k > runs of length 1, followed by a run 
of length > 2 from position k + l. We will show that the vector u* obtained 
by complementing u from position k+2 onwards satisfies |L>2 (■"*)! > l-^2( u )|- 
By repeating this operation we eventually arrive at one of the vectors (|2l|). 

Since \Dk{u)\ = \Dk(u)\, we may assume that the run following the initial 
k runs of length 1 in u begins llx . . . . In u* this is replaced by lOx 
Then we find that u* has r(u) + 1 runs, and wt(u*") = wt(u") — 2 + 2x, from 
which it follows using (jT7|) that \D 2 (u*)\ - \D 2 (u)\ = r(u) + 2x - 3 > 0, as 
required. □ 



6 Related work 

The history of deletion-correcting codes is closely tied up with studies of 
codes for correcting other classes of errors such as: 

• erasures, when bits whose positions are known are deleted 

• insertions of bits (rather than deletions) 

• asymmetric errors, when the only errors that occur are that l's may 
be changed to 0's (this is also known Z-channel) 

• unidirectional errors: 0's may be changed to l's or l's to 0's, but only 
one type of error occurs in any particular transmission 

• bit reversals: 0's may be changed to l's or vice versa — this is the 
subject of classical coding theory 

• transpositions: adjacent bits may be swapped 

• any meaningful combination of the above. 

Furthermore the alphabet may be changed from F2 to ¥ q . This produces 
an extensive list of families of codes, and of course in each case one can ask 
for the largest codes. 

In this section we give a brief overview of some other relevant papers. 



First, Levenshtein's papers [Lcv65], [Lev65a], Lev92 |, [Lev01| should be 
considered essential reading. 



Hartnett [ | (see especially Calabi and Hartnett [ CH69| ]) contains 



some general investigations of all the above-mentioned codes (both block 
codes and variable length codes) from a fairly abstract mathematical point 
of view. 
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One of the earliest papers to study deletion-correcting codes is Sellers 
3c62], which combines a special separating string between blocks with a 
burst-error correcting code inside the blocks. 

Ullman [UU66J uses a construction similar to that of Varshamov and 
Tenegolts, but his codes are not as efficient and also use a separating string 
between blocks. In [U1167] he gives bounds on the size of codes for correcting 
synchronization errors. 

Tenengolts [Tcn84] generalizes the VT a (n) codes to larger alphabets. 
Nonbinary codes are also discussed in fBo9i , flBo95j |5o8E|, jMa9|]. 

Other constructions for deletion-correcting and related codes are given 



by Calabi and Hartnett CH69a |, Iizuka, Kasahara and Namekawa [IKN], 
Kl0ve pl95[| and Tanaka and Kasai [[TK76|1 . 

The most recent paper on this subject is by Schulman and Zuckerman 
!Z9£], who present what they describe as "simple, polynomial-time en- 
codable and decodable codes which are asymptotically good for channels 
allowing insertions, deletions and transpositions". The number of errors 
that can be corrected is some constant fraction of the block-length n. The 
constructions are not explicit. 

We conclude this section by mentioning some papers on peripherally 
related codes. Codes for correcting asymmetric and unidirectional errors 
are discussed in pR82||, p9l, p098|, [|WVB88|1 and [|WVB89|. Erasure 



correcting codes are discussed by Alon and Luby [i L9C] and Barg |Ba£ 
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